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Abstract 

Real-world  videos  often  have  complex  dynamics;  and 
methods  for  generating  open-domain  video  descriptions 
should  be  sensitive  to  temporal  structure  and  allow  both  in¬ 
put  ( sequence  of  frames)  and  output  ( sequence  of  words)  of 
variable  length.  To  approach  this  problem,  we  propose  a 
novel  end-to-end  sequence-to-sequence  model  to  generate 
captions  for  videos.  For  this  we  exploit  recurrent  neural  net¬ 
works,  specifically  LSTMs,  which  have  demonstrated  state- 
of-the-art  performance  in  image  caption  generation.  Our 
LSTM  model  is  trained  on  video -sentence  pairs  and  learns 
to  associate  a  sequence  of  video  frames  to  a  sequence  of 
words  in  order  to  generate  a  description  of  the  event  in  the 
video  clip.  Our  model  naturally  is  able  to  learn  the  tem¬ 
poral  structure  of  the  sequence  of  frames  as  well  as  the  se¬ 
quence  model  of  the  generated  sentences,  i.e.  a  language 
model.  We  evaluate  several  variants  of  our  model  that  ex¬ 
ploit  different  visual  features  on  a  standard  set  of  YouTube 
videos  and  two  movie  description  datasets  (M-VAD  and 
MPII-MD). 
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Figure  1 .  Our  S2VT  approach  performs  video  description  using 
a  sequence  to  sequence  model.  It  incorporates  a  stacked  LSTM 
which  first  reads  the  sequence  of  frames  and  then  generates  a  se¬ 
quence  of  words.  The  input  visual  sequence  to  the  model  is  com¬ 
prised  of  RGB  and/or  optical  flow  CNN  outputs. 


1.  Introduction 

Describing  visual  content  with  natural  language  text  has 
recently  received  increased  interest,  especially  describing 
images  with  a  single  sentence  [8,  5,  16,  18,  20,  23,  29,  40]. 
Video  description  has  so  far  seen  less  attention  despite  its 
important  applications  in  human-robot  interaction,  video  in¬ 
dexing,  and  describing  movies  for  the  blind.  While  image 
description  handles  a  variable  length  output  sequence  of 
words,  video  description  also  has  to  handle  a  variable  length 
input  sequence  of  frames.  Related  approaches  to  video 
description  have  resolved  variable  length  input  by  holistic 
video  representations  [29,  28, 1 1],  pooling  over  frames  [39], 
or  sub-sampling  on  a  fixed  number  of  input  frames  [43].  In 
contrast,  in  this  work  we  propose  a  sequence  to  sequence 
model  which  is  trained  end-to-end  and  is  able  to  learn  arbi¬ 
trary  temporal  structure  in  the  input  sequence.  Our  model 
is  sequence  to  sequence  in  a  sense  that  it  reads  in  frames 


sequentially  and  outputs  words  sequentially. 

The  problem  of  generating  descriptions  in  open  domain 
videos  is  difficult  not  just  due  to  the  diverse  set  of  objects, 
scenes,  actions,  and  their  attributes,  but  also  because  it  is 
hard  to  determine  the  salient  content  and  describe  the  event 
appropriately  in  context.  To  learn  what  is  worth  describ¬ 
ing,  our  model  learns  from  video  clips  and  paired  sentences 
that  describe  the  depicted  events  in  natural  language.  We 
use  Long  Short  Term  Memory  (LSTM)  networks  [12],  a 
type  of  recurrent  neural  network  (RNN)  that  has  achieved 
great  success  on  similar  sequence-to-sequence  tasks  such  as 
speech  recognition  [10]  and  machine  translation  [3<  ].  Due 
to  the  inherent  sequential  nature  of  videos  and  language, 
LSTMs  are  well- suited  for  generating  descriptions  of  events 
in  videos. 

The  main  contribution  of  this  work  is  to  propose  a  novel 
model,  S2VT,  which  learns  to  directly  map  a  sequence  of 
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frames  to  a  sequence  of  words.  Figure  1  depicts  our  model. 
A  stacked  LSTM  first  encodes  the  frames  one  by  one,  tak¬ 
ing  as  input  the  output  of  a  Convolutional  Neural  Network 
(CNN)  applied  to  each  input  frame’s  intensity  values.  Once 
all  frames  are  read,  the  model  generates  a  sentence  word 
by  word.  The  encoding  and  decoding  of  the  frame  and 
word  representations  are  learned  jointly  from  a  parallel  cor¬ 
pus.  To  model  the  temporal  aspects  of  activities  typically 
shown  in  videos,  we  also  compute  the  optical  flow  [2]  be¬ 
tween  pairs  of  consecutive  frames.  The  flow  images  are  also 
passed  through  a  CNN  and  provided  as  input  to  the  LSTM. 
Flow  CNN  models  have  been  shown  to  be  beneficial  for  ac¬ 
tivity  recognition  [31,8]. 

To  our  knowledge,  this  is  the  first  approach  to  video  de¬ 
scription  that  uses  a  general  sequence  to  sequence  model. 
This  allows  our  model  to  (a)  handle  a  variable  number  of 
input  frames,  (b)  learn  and  use  the  temporal  structure  of  the 
video  and  (c)  learn  a  language  model  to  generate  natural, 
grammatical  sentences.  Our  model  is  learned  jointly  and 
end-to-end,  incorporating  both  intensity  and  optical  flow 
inputs,  and  does  not  require  an  explicit  attention  model. 
We  demonstrate  that  S2VT  achieves  state-of-the-art  perfor¬ 
mance  on  three  diverse  datasets,  a  standard  YouTube  cor¬ 
pus  (MSVD)  [3]  and  the  M-VAD  [2  ]  and  MPII  Movie 
Description  [28]  datasets.  Our  implementation  (based  on 
the  Caffe  [15]  deep  learning  framework)  is  available  on 
github.  https  :  /  /  git hub  .  com/vsubhashini/ caffe/ 
tree/ recurrent /examples /s2vt. 

2.  Related  Work 

Early  work  on  video  captioning  considered  tagging 
videos  with  metadata  [1]  and  clustering  captions  and  videos 
[14,  25,  42]  for  retrieval  tasks.  Several  previous  methods 
for  generating  sentence  descriptions  [11,  19,  36]  used  a  two 
stage  pipeline  that  first  identifies  the  semantic  content  (sub¬ 
ject,  verb,  object)  and  then  generates  a  sentence  based  on  a 
template.  This  typically  involved  training  individual  classi¬ 
fiers  to  identify  candidate  objects,  actions  and  scenes.  They 
then  use  a  probabilistic  graphical  model  to  combine  the  vi¬ 
sual  confidences  with  a  language  model  in  order  to  estimate 
the  most  likely  content  (subject,  verb,  object,  scene)  in  the 
video,  which  is  then  used  to  generate  a  sentence.  While  this 
simplified  the  problem  by  detaching  content  generation  and 
surface  realization,  it  requires  selecting  a  set  of  relevant  ob¬ 
jects  and  actions  to  recognize.  Moreover,  a  template-based 
approach  to  sentence  generation  is  insufficient  to  model  the 
richness  of  language  used  in  human  descriptions  -  e.g., 
which  attributes  to  use  and  how  to  combine  them  effec¬ 
tively  to  generate  a  good  description.  In  contrast,  our  ap¬ 
proach  avoids  the  separation  of  content  identification  and 
sentence  generation  by  learning  to  directly  map  videos  to 
full  human-provided  sentences,  learning  a  language  model 
simultaneously  conditioned  on  visual  features. 


Our  models  take  inspiration  from  the  image  caption  gen¬ 
eration  models  in  [8,  40].  Their  first  step  is  to  generate  a 
fixed  length  vector  representation  of  an  image  by  extract¬ 
ing  features  from  a  CNN.  The  next  step  learns  to  decode 
this  vector  into  a  sequence  of  words  composing  the  descrip¬ 
tion  of  the  image.  While  any  RNN  can  be  used  in  principle 
to  decode  the  sequence,  the  resulting  long-term  dependen¬ 
cies  can  lead  to  inferior  performance.  To  mitigate  this  issue, 
LSTM  models  have  been  exploited  as  sequence  decoders,  as 
they  are  more  suited  to  learning  long-range  dependencies. 
In  addition,  since  we  are  using  variable-length  video  as  in¬ 
put,  we  use  LSTMs  as  sequence  to  sequence  transducers, 
following  the  language  translation  models  of  [34]. 

In  [39],  LSTMs  are  used  to  generate  video  descriptions 
by  pooling  the  representations  of  individual  frames.  Their 
technique  extracts  CNN  features  for  frames  in  the  video  and 
then  mean-pools  the  results  to  get  a  single  feature  vector 
representing  the  entire  video.  They  then  use  an  LSTM  as 
a  sequence  decoder  to  generate  a  description  based  on  this 
vector.  A  major  shortcoming  of  this  approach  is  that  this 
representation  completely  ignores  the  ordering  of  the  video 
frames  and  fails  to  exploit  any  temporal  information.  The 
approach  in  [8]  also  generates  video  descriptions  using  an 
LSTM;  however,  they  employ  a  version  of  the  two-step  ap¬ 
proach  that  uses  CRFs  to  obtain  semantic  tuples  of  activity, 
object,  tool,  and  locatation  and  then  use  an  LSTM  to  trans¬ 
late  this  tuple  into  a  sentence.  Moreover,  the  model  in  [8]  is 
applied  to  the  limited  domain  of  cooking  videos  while  ours 
is  aimed  at  generating  descriptions  for  videos  “in  the  wild”. 

Contemporaneous  with  our  work,  the  approach  in  [4.  ] 
also  addresses  the  limitations  of  [3  ]  in  two  ways.  First, 
they  employ  a  3-D  convnet  model  that  incorporates  spatio- 
temporal  motion  features.  To  obtain  the  features,  they  as¬ 
sume  videos  are  of  fixed  volume  (width,  height,  time).  They 
extract  dense  trajectory  features  (HoG,  HoF,  MBH)  [41] 
over  non-overlapping  cuboids  and  concatenate  these  to  form 
the  input.  The  3-D  convnet  is  pre-trained  on  video  datasets 
for  action  recognition.  Second,  they  include  an  attention 
mechanism  that  learns  to  weight  the  frame  features  non- 
uniformly  conditioned  on  the  previous  word  input(s)  rather 
than  uniformly  weighting  features  from  all  frames  as  in 
[39].  The  3-D  convnet  alone  provides  limited  performance 
improvement,  but  in  conjunction  with  the  attention  model  it 
notably  improves  performance.  We  propose  a  simpler  ap¬ 
proach  to  using  temporal  information  by  using  an  LSTM 
to  encode  the  sequence  of  video  frames  into  a  distributed 
vector  representation  that  is  sufficient  to  generate  a  senten¬ 
tial  description.  Therefore,  our  direct  sequence  to  sequence 
model  does  not  require  an  explicit  attention  mechanism. 

Another  recent  project  [33]  uses  LSTMs  to  predict  the 
future  frame  sequence  from  an  encoding  of  the  previous 
frames.  Their  model  is  more  similar  to  the  language  trans¬ 
lation  model  in  [34],  which  uses  one  LSTM  to  encode  the 


4535 


input  text  into  a  fixed  representation,  and  another  LSTM  to 
decode  it  into  a  different  language.  In  contrast,  we  employ  a 
single  LSTM  that  learns  both  encoding  and  decoding  based 
on  the  inputs  it  is  provided.  This  allows  the  LSTM  to  share 
weights  between  encoding  and  decoding. 

Other  related  work  includes  [24,  8],  which  uses  LSTMs 
for  activity  classification,  predicting  an  activity  class  for  the 
representation  of  each  image/flow  frame.  In  contrast,  our 
model  generates  captions  after  encoding  the  complete  se¬ 
quence  of  optical  flow  images.  Specifically,  our  final  model 
is  an  ensemble  of  the  sequence  to  sequence  models  trained 
on  raw  images  and  optical  flow  images. 

3.  Approach 

We  propose  a  sequence  to  sequence  model  for  video  de¬ 
scription,  where  the  input  is  the  sequence  of  video  frames 
(x\, . . .  ,xn\  and  the  output  is  the  sequence  of  words 
(2/i ,  •  •  • ,  2/m).  Naturally,  both  the  input  and  output  are  of 
variable,  potentially  different,  lengths.  In  our  case,  there 
are  typically  many  more  frames  than  words. 

In  our  model,  we  estimate  the  conditional  probability  of 
an  output  sequence  (2/1 , . . . ,  ym)  given  an  input  sequence 
Oi,...,xn)i.e. 

p(y1,...,ym\Xl}...,Xn)  (1) 

This  problem  is  analogous  to  machine  translation  between 

natural  languages,  where  a  sequence  of  words  in  the  input 
language  is  translated  to  a  sequence  of  words  in  the  output 
language.  Recently,  [6,  34]  have  shown  how  to  effectively 
attack  this  sequence  to  sequence  problem  with  an  LSTM 
Recurrent  Neural  Network  (RNN).  We  extend  this  paradigm 
to  inputs  comprised  of  sequences  of  video  frames,  signifi¬ 
cantly  simplifying  prior  RNN-based  methods  for  video  de¬ 
scription.  In  the  following,  we  describe  our  model  and  ar¬ 
chitecture  in  detail,  as  well  as  our  input  and  output  repre¬ 
sentation  for  video  and  sentences. 

3.1.  LSTMs  for  sequence  modeling 

The  main  idea  to  handle  variable-length  input  and  output 
is  to  first  encode  the  input  sequence  of  frames,  one  at  a  time, 
representing  the  video  using  a  latent  vector  representation, 
and  then  decode  from  that  representation  to  a  sentence,  one 
word  at  a  time. 

Let  us  first  recall  the  Long  Short  Term  Memory  RNN 
(LSTM),  originally  proposed  in  [L  ].  Relying  on  the  LSTM 
unit  proposed  in  [44],  for  an  input  xt  at  time  step  t ,  the 
LSTM  computes  a  hidden/control  state  ht  and  a  memory 
cell  state  ct  which  is  an  encoding  of  everything  the  cell  has 
observed  until  time  t: 

it  —  <r(WxiXt  +  ^Yhiht— 1  T  bf) 

ft  =  <j(wxfxt  +  Whfht- 1  +  bf) 

ot  =  cr(Wxoxt  +  Whofk-i  +  bQ)  (2) 

gt  =  (j){WxgXt  +  Whght-l  +  bg ) 

dfc  ft  ©  Cf—  1  it  ©  Qt 

ht  =  ot  © 


where  a  is  the  sigmoidal  non-linearity,  </>  is  the  hyperbolic 
tangent  non-linearity,  ©  represents  the  element-wise  prod¬ 
uct  with  the  gate  value,  and  the  weight  matrices  denoted  by 
Wij  and  biases  bj  are  the  trained  parameters. 

Thus,  in  the  encoding  phase,  given  an  input  sequence 
X  (x\, . . . ,  xn),  the  LSTM  computes  a  sequence  of  hidden 
states  (hi,...,  hn).  During  decoding  it  defines  a  distribu¬ 
tion  over  the  output  sequence  Y  (y\, . . . ,  ym)  given  the  in¬ 
put  sequence  X  as  p(Y\X)  is 

m 

p(y  1, •  •  -,ym\xi, ...,xn)  =  Y[p{yt\hn+t-i,yt-i)  (3) 

t=  1 

where  the  distribution  of  p(yt\hn+t)  is  given  by  a  softmax 
over  all  of  the  words  in  the  vocabulary  (see  Equation  5). 
Note  that  hn+t  is  obtained  from  hn+t-i,yt-i  based  on  the 
recursion  in  Equation  2. 

3.2.  Sequence  to  sequence  video  to  text 

Our  approach,  S2VT,  is  depicted  in  Figure  2.  While 
[6,  34]  first  encode  the  input  sequence  to  a  fixed  length  vec¬ 
tor  using  one  LSTM  and  then  use  another  LSTM  to  map  the 
vector  to  a  sequence  of  outputs,  we  rely  on  a  single  LSTM 
for  both  the  encoding  and  decoding  stage.  This  allows  pa¬ 
rameter  sharing  between  the  encoding  and  decoding  stage. 

Our  model  uses  a  stack  of  two  LSTMs  with  1000  hid¬ 
den  units  each.  Figure  2  shows  the  LSTM  stack  unrolled 
over  time.  When  two  LSTMs  are  stacked  together,  as  in  our 
case,  the  hidden  representation  ( ht )  from  the  first  LSTM 
layer  (colored  red)  is  provided  as  the  input  (xt)  to  the  sec¬ 
ond  LSTM  (colored  green).  The  top  LSTM  layer  in  our  ar¬ 
chitecture  is  used  to  model  the  visual  frame  sequence,  and 
the  next  layer  is  used  to  model  the  output  word  sequence. 

Training  and  Inference  In  the  first  several  time  steps, 
the  top  LSTM  layer  (colored  red  in  Figure  2)  receives  a  se¬ 
quence  of  frames  and  encodes  them  while  the  second  LSTM 
layer  receives  the  hidden  representation  (ht)  and  concate¬ 
nates  it  with  null  padded  input  words  (zeros),  which  it  then 
encodes.  There  is  no  loss  during  this  stage  when  the  LSTMs 
are  encoding.  After  all  the  frames  in  the  video  clip  are  ex¬ 
hausted,  the  second  LSTM  layer  is  fed  the  beginning-of- 
sentence  (<BOS>)  tag,  which  prompts  it  to  start  decoding 
its  current  hidden  representation  into  a  sequence  of  words. 
While  training  in  the  decoding  stage,  the  model  maximizes 
for  the  log-likelihood  of  the  predicted  output  sentence  given 
the  hidden  representation  of  the  visual  frame  sequence,  and 
the  previous  words  it  has  seen.  From  Equation  3  for  a  model 
with  parameters  9  and  output  sequence  Y  =  (y\, . . . ,  ym), 
this  is  formulated  as: 

m 

9*  =  argmax  V'logp(yt|/i„+t_i,yt_i;6»)  (4) 
6 

This  log-likelihood  is  optimized  over  the  entire  training 
dataset  using  stochastic  gradient  descent.  The  loss  is  com¬ 
puted  only  when  the  LSTM  is  learning  to  decode.  Since  this 
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Figure  2.  We  propose  a  stack  of  two  LSTMs  that  learn  a  representation  of  a  sequence  of  frames  in  order  to  decode  it  into  a  sentence  that 
describes  the  event  in  the  video.  The  top  LSTM  layer  (colored  red)  models  visual  feature  inputs.  The  second  LSTM  layer  (colored  green) 
models  language  given  the  text  input  and  the  hidden  representation  of  the  video  sequence.  We  use  <BOS>  to  indicate  begin-of- sentence 
and  <EOS>  for  the  end-of- sentence  tag.  Zeros  are  used  as  a  <pad>  when  there  is  no  input  at  the  time  step. 


loss  is  propagated  back  in  time,  the  LSTM  learns  to  gener¬ 
ate  an  appropriate  hidden  state  representation  (hn)  of  the 
input  sequence.  The  output  (zt)  of  the  second  LSTM  layer 
is  used  to  obtain  the  emitted  word  (y).  We  apply  a  softmax 
function  to  get  the  probability  distribution  over  the  words  yf 
in  the  vocabulary  V : 


p{y\zt )  = 


exp  (WyZt) 

Y,y'evexP(Wy'Zt) 


(5) 


We  note  that,  during  the  decoding  phase,  the  visual  frame 
representation  for  the  first  LSTM  layer  is  simply  a  vector 
of  zeros  that  acts  as  padding  input.  We  require  an  explicit 
end-of-sentence  tag  (<EOS>)  to  terminate  each  sentence 
since  this  enables  the  model  to  define  a  distribution  over 
sequences  of  varying  lengths.  At  test  time,  during  each  de¬ 
coding  step  we  choose  the  word  yt  with  the  maximum  prob¬ 
ability  after  the  softmax  (from  Equation  5)  until  it  emits  the 
<EOS>  token. 


3.3.  Video  and  text  representation 

RGB  frames.  Similar  to  previous  LSTM-based  image  cap¬ 
tioning  efforts  [8, 40]  and  video-to-text  approaches  [39, 43], 
we  apply  a  convolutional  neural  network  (CNN)  to  input 
images  and  provide  the  output  of  the  top  layer  as  input  to 
the  LSTM  unit.  In  this  work,  we  report  results  using  the  out¬ 
put  of  the  fc7  layer  (after  applying  the  ReLU  non-linearity) 
on  the  Caffe  Reference  Net  (a  variant  of  AlexNet)  and  also 
the  16-layer  VGG  model  [31  ].  We  use  CNNs  that  are  pre¬ 
trained  on  the  1.2M  image  ILSVRC-2012  object  classifica¬ 
tion  subset  of  the  ImageNet  dataset  [30]  and  made  available 
publicly  via  the  Caffe  ModelZoo.1  Each  input  video  frame 
is  scaled  to  256x256,  and  is  cropped  to  a  random  227x227 

1  https : //github . com/BVLC/caf fe/wiki/Model-Zoo 


region.  It  is  then  processed  by  the  CNN.  We  remove  the 
original  last  fully-connected  classification  layer  and  learn  a 
new  linear  embedding  of  the  features  to  a  500  dimensional 
space.  The  lower  dimension  features  form  the  input  (xt) 
to  the  first  LSTM  layer.  The  weights  of  the  embedding  are 
learned  jointly  with  the  LSTM  layers  during  training. 
Optical  Flow.  In  addition  to  CNN  outputs  from  raw  im¬ 
age  (RGB)  frames,  we  also  incorporate  optical  flow  mea¬ 
sures  as  input  sequences  to  our  architecture.  Others  [24,  8] 
have  shown  that  incorporating  optical  flow  information  to 
LSTMs  improves  activity  classification.  As  many  of  our 
descriptions  are  activity  centered,  we  explore  this  option 
for  video  description  as  well.  We  follow  the  approach  in 
[8,  ]  and  first  extract  classical  variational  optical  flow  fea¬ 
tures  [2].  We  then  create  flow  images  (as  seen  in  Figure  1) 
in  a  manner  similar  to  [9],  by  centering  x  and  y  flow  values 
around  128  and  multiplying  by  a  scalar  such  that  flow  values 
fall  between  0  and  255.  We  also  calculate  the  flow  magni¬ 
tude  and  add  it  as  a  third  channel  to  the  flow  image.  We 
then  use  a  CNN  [9]  initialized  with  weights  trained  on  the 
UCF101  video  dataset  to  classify  optical  flow  images  into 
101  activity  classes.  The  fc6  layer  activations  of  the  CNN 
are  embedded  in  a  lower  500  dimensional  space  which  is 
then  given  as  input  to  the  LSTM.  The  rest  of  the  LSTM  ar¬ 
chitecture  remains  unchanged  for  flow  inputs. 

In  our  combined  model,  we  use  a  shallow  fusion  tech¬ 
nique  to  integrate  flow  and  RGB  features.  At  each  time 
step  of  the  decoding  phase,  the  model  proposes  a  set  of  can¬ 
didate  words.  We  then  rescore  these  hypotheses  with  the 
weighted  sum  of  the  scores  by  the  flow  and  RGB  networks, 
where  we  only  need  to  recompute  the  score  of  each  new 
word  p(yt  =  y' )  as: 

&  ’  Prgb(l/t  =  V  )  T~  (1  Ct)  •  Pflow{Vt  =  V  ) 

the  hyper-parameter  a  is  tuned  on  the  validation  set. 
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MSVD 

MPII-MD 

MVAD 

#-sentences 

80,827 

68,375 

56,634 

#-tokens 

567,874 

679,157 

568,408 

vocab 

12,594 

21,700 

18,092 

#-videos 

1,970 

68,337 

46,009 

avg.  length 

10.2s 

3.9s 

6.2s 

#-sents  per  video 

«41 

1 

1-2 

Table  1.  Corpus  Statistics.  The  the  number  of  tokens  in  all  datasets 
are  comparable,  however  MSVD  has  multiple  descriptions  for 
each  video  while  the  movie  corpora  (MPII-MD,  MVAD)  have  a 
large  number  of  clips  with  a  single  description  each.  Thus,  the 
number  of  video- sentence  pairs  in  all  3  datasets  are  comparable. 

Text  input.  The  target  output  sequence  of  words  are 
represented  using  one-hot  vector  encoding  (1-of-N  coding, 
where  N  is  the  size  of  the  vocabulary).  Similar  to  the  treat¬ 
ment  of  frame  features,  we  embed  words  to  a  lower  500 
dimensional  space  by  applying  a  linear  transformation  to 
the  input  data  and  learning  its  parameters  via  back  propa¬ 
gation.  The  embedded  word  vector  concatenated  with  the 
output  {ht)  of  the  first  LSTM  layer  forms  the  input  to  the 
second  LSTM  layer  (marked  green  in  Figure  2).  When  con¬ 
sidering  the  output  of  the  LSTM  we  apply  a  softmax  over 
the  complete  vocabulary  as  in  Equation  5. 

4.  Experimental  Setup 

This  secction  describes  the  evaluation  of  our  approach. 
We  first  describe  the  datasets  used,  then  the  evaluation  pro¬ 
tocol,  and  then  the  details  of  our  models. 

4.1.  Video  description  datasets 

We  report  results  on  three  video  description  corpora, 
namely  the  Microsoft  Video  Description  corpus  (MSVD) 
[3],  the  MPII  Movie  Description  Corpus  (MPII-MD)  [28], 
and  the  Montreal  Video  Annotation  Dataset  (M-VAD) 
[37].  Together  they  form  the  largest  parallel  corpora 
with  open  domain  video  and  natural  language  descriptions. 
While  MSVD  is  based  on  web  clips  with  short  human- 
annotated  sentences,  MPII-MD  and  M-VAD  contain  Holly¬ 
wood  movie  snippets  with  descriptions  sourced  from  script 
data  and  audio  description.  Statistics  of  each  corpus  are 
presented  in  Table  1 . 

4.1.1  Microsoft  Video  Description  Corpus  (MSVD) 

The  Microsoft  Video  description  corpus  [3],  is  a  collection 
of  Youtube  clips  collected  on  Mechanical  Turk  by  request¬ 
ing  workers  to  pick  short  clips  depicting  a  single  activity. 
The  videos  were  then  used  to  elicit  single  sentence  descrip¬ 
tions  from  annotators.  The  original  corpus  has  multi-lingual 
descriptions,  in  this  work  we  use  only  the  English  descrip¬ 
tions.  We  do  minimal  pre-processing  on  the  text  by  con¬ 


verting  all  text  to  lower  case,  tokenizing  the  sentences  and 
removing  punctuation.  We  use  the  data  splits  provided  by 
[39].  Additionally,  in  each  video,  we  sample  every  tenth 
frame  as  done  by  [39]. 

4.1.2  MPII  Movie  Description  Dataset  (MPII-MD) 

MPII-MD  [28]  contains  around  68,000  video  clips  extracted 
from  94  Hollywood  movies.  Each  clip  is  accompanied  with 
a  single  sentence  description  which  is  sourced  from  movie 
scripts  and  audio  description  (AD)  data.  AD  or  Descrip¬ 
tive  Video  Service  (DVS)  is  an  additional  audio  track  that 
is  added  to  the  movies  to  describe  explicit  visual  elements 
in  a  movie  for  the  visually  impaired.  Although  the  movie 
snippets  are  manually  aligned  to  the  descriptions,  the  data  is 
very  challenging  due  to  the  high  diversity  of  visual  and  tex¬ 
tual  content,  and  the  fact  that  most  snippets  have  only  a  sin¬ 
gle  reference  sentence.  We  use  the  training/validation/test 
split  provided  by  the  authors  and  extract  every  fifth  frame 
(videos  are  shorter  than  MSVD,  averaging  94  frames). 

4.1.3  Montreal  Video  Annotation  Dataset  (M-VAD) 

The  M-VAD  movie  description  corpus  [37]  is  another  recent 
collection  of  about  49,000  short  video  clips  from  92  movies. 
It  is  similar  to  MPII-MD,  but  contains  only  AD  data  with 
automatic  alignment.  We  use  the  same  setup  as  for  MPII- 
MD. 

4.2.  Evaluation  Metrics 

Quantitative  evaluation  of  the  models  are  performed  us¬ 
ing  the  METEOR  [7]  metric  which  was  originally  pro¬ 
posed  to  evaluate  machine  translation  results.  The  ME¬ 
TEOR  score  is  computed  based  on  the  alignment  between 
a  given  hypothesis  sentence  and  a  set  of  candidate  refer¬ 
ence  sentences.  METEOR  compares  exact  token  matches, 
stemmed  tokens,  paraphrase  matches,  as  well  as  semanti¬ 
cally  similar  matches  using  WordNet  synonyms.  This  se¬ 
mantic  aspect  of  METEOR  distinguishes  it  from  others  such 
as  BLEU  [26],  ROUGE-L  [2  ],  or  CIDEr  [38].  The  au¬ 
thors  of  CIDEr  [38]  evaluated  these  four  measures  for  im¬ 
age  description.  They  showed  that  METEOR  is  always  bet¬ 
ter  than  BLEU  and  ROUGE  and  outperforms  CIDEr  when 
the  number  of  references  are  small  (CIDEr  is  comparable  to 
METEOR  when  the  number  of  references  are  large).  Since 
MPII-MD  and  M-VAD  have  only  a  single  reference,  we  de¬ 
cided  to  use  METEOR  in  all  our  evaluations.  We  employ 
METEOR  version  1.5  2  using  the  code2 3  released  with  the 
Microsoft  COCO  Evaluation  Server  [4]. 

4.3.  Experimental  details  of  our  models 

All  our  models  take  as  input  either  the  raw  RGB  frames 
directly  feeding  into  the  CNN,  or  pre-processed  optical  flow 

2http : //www . cs . emu . edu/ ~alavie /METEOR 

3https : //github . com/tylin/ coco-caption 
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images  (described  in  Section  3.3).  In  all  of  our  models,  we 
unroll  the  LSTM  to  a  fixed  80  time  steps  during  training. 
We  found  this  to  be  a  good  trade-off  between  memory  con¬ 
sumption  and  the  ability  to  provide  many  frames  (videos)  to 
the  LSTM.  This  setting  allows  us  to  fit  multiple  videos  in  a 
single  mini-batch  (up  to  8  for  AlexNet  and  up  to  3  for  flow 
models).  We  note  that  94%  of  the  YouTube  training  videos 
satisfied  this  limit  (with  frames  sampled  at  the  rate  of  1  in 
10).  For  videos  with  fewer  than  80  time  steps  (of  words 
and  frames),  we  pad  the  remaining  inputs  with  zeros.  For 
longer  videos,  we  truncate  the  number  of  frames  to  ensure 
that  the  sum  of  the  number  of  frames  and  words  is  within 
this  limit.  At  test  time,  we  do  not  constrain  the  length  of  the 
video  and  our  model  views  all  sampled  frames.  We  use  the 
pre-trained  AlexNet  and  VGG  CNNs.  For  VGG,  we  fix  all 
layers  below  fc7  to  reduce  memory  consumption  and  allow 
faster  training. 

We  compare  our  sequence  to  sequence  LSTM  architec¬ 
ture  with  RGB  image  features  extracted  from  both  AlexNet, 
and  the  16-layer  VGG  network.  In  order  to  compare  fea¬ 
tures  from  the  VGG  network  with  previous  models,  we  in¬ 
clude  the  performance  of  the  mean-pooled  model  proposed 
in  [39]  using  the  output  of  the  fc7  layer  from  the  16  layer 
VGG  as  a  baseline  (line  3,  Table  2).  All  our  sequence  to  se¬ 
quence  models  are  referenced  in  Table  2  under  S2VT.  Our 
first  variant,  RGB  (AlexNet)  is  the  end-to-end  model  that 
uses  AlexNet  on  RGB  frames.  Flow  (AlexNet)  refers  to  the 
model  that  is  obtained  by  training  on  optical  flow  images. 
RGB  (VGG)  refers  to  the  model  with  the  16-layer  VGG 
model  on  RGB  image  frames.  We  also  experiment  with  ran¬ 
domly  re-ordered  input  frames  (line  10)  to  verify  that  S2VT 
learns  temporal- sequence  information.  Our  final  model  is 
an  ensemble  of  the  RGB  (VGG)  and  Flow  (AlexNet)  where 
the  prediction  at  each  time  step  is  a  weighted  average  of  the 
prediction  from  the  individual  models. 

4.4.  Related  approaches 

We  compare  our  sequence  to  sequence  models  against 
the  factor  graph  model  (FGM)  in  [3  ],  the  mean-pooled 
models  in  [3S  ]  and  the  Soft- Attention  models  of  [43]. 

FGM  proposed  in  [3t]  uses  a  two  step  approach  to  first 
obtain  confidences  on  subject,  verb,  object  and  scene 
(S,V,0,P)  elements  and  combines  these  with  confidences 
from  a  language  model  using  a  factor  graph  to  infer  the  most 
likely  (S,V,0,P)  tuple  in  the  video.  It  then  generates  a  sen¬ 
tence  based  on  a  template. 

The  Mean  Pool  model  proposed  in  [39]  pools  AlexNet  fc7 
activations  across  all  frames  to  create  a  fixed-length  vector 
representation  of  the  video.  It  then  uses  an  LSTM  to  then 
decode  the  vector  into  a  sequence  of  words.  Further,  the 
model  ia  pre-trained  on  the  Flickr30k  [13]  and  MSCOCO 
[2  ]  image-caption  datasets  and  fine-tuned  on  MSVD  for 
a  significant  improvement  in  performance.  We  compare 


Model 

METEOR 

FGM  [3  ] 

23.9 

(1) 

Mean  pool 

-  AlexNet  [39] 

26.9 

(2) 

-VGG 

27.7 

(3) 

-  AlexNet  COCO  pre-trained  [39] 

29.1 

(4) 

-  GoogleNet  [43] 

28.7 

(5) 

Temporal  attention 

-  GoogleNet  [43] 

29.0 

(6) 

-  GoogleNet  +  3D-CNN  [43] 

29.6 

(7) 

S2VT  (ours) 

-  Flow  (AlexNet) 

24.3 

(8) 

-  RGB  (AlexNet) 

27.9 

(9) 

-  RGB  (VGG)  random  frame  order 

28.2 

GO) 

-  RGB  (VGG) 

29.2 

CD 

-  RGB  (VGG)  +  Flow  (AlexNet) 

29.8 

(12) 

Table  2.  MSVD  dataset  (METEOR  in  %,  higher  is  better). 


our  models  against  their  basic  mean-pooled  model  and  their 
best  model  obtained  from  fine-tuning  on  Flickr30k  and 
COCO  datasets.  We  also  compare  against  the  GoogleNet 
[35]  variant  of  the  mean-pooled  model  reported  in  [42  ]. 

The  Temporal-Attention  model  in  [4  •  ]  is  a  combination  of 
weighted  attention  over  a  fixed  set  of  video  frames  with  in¬ 
put  features  from  GoogleNet  and  a  3D-convnet  trained  on 
HoG,  HoF  and  MBH  features  from  an  activity  classification 
model. 

5.  Results  and  Discussion 

This  section  discussses  the  result  of  our  evaluation 
shown  in  Tables  2,  4,  and  5. 

5.1.  MSVD  dataset 

Table  2  shows  the  results  on  the  MSVD  dataset.  Rows 
1  through  7  present  related  approaches  and  the  rest  are 
variants  of  our  S2VT  approach.  Our  basic  S2VT  AlexNet 
model  on  RGB  video  frames  (line  9  in  Table  2)  achieves 
27.9%  METEOR  and  improves  over  the  basic  mean-pooled 
model  in  [39]  (line  2,  26.9%)  as  well  as  the  VGG  mean- 
pooled  model  (line  3,  27. 7%) suggesting  that  S2VT  is  a 
more  powerful  approach.  When  the  model  is  trained  with 
randomly-ordered  frames  (line  10  in  Table  2),  the  score  is 
considerably  lower,  clearly  demonstrating  that  the  model 
benefits  from  exploiting  temporal  structure. 

Our  S2VT  model  which  uses  flow  images  (line  8) 
achieves  only  24.3%  METEOR  but  improves  the  perfor¬ 
mance  of  our  VGG  model  from  29.2%  (line  11)  to  29.8% 
(line  12),  when  combined.  A  reason  for  the  low  perfor¬ 
mance  of  the  flow  model  could  be  that  optical  flow  fea¬ 
tures  even  for  the  same  activity  can  vary  significantly  with 
context  e.g.  ‘panda  eating’  vs  ‘person  eating’.  Also,  the 
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Edit-Distance 

o 

II 

k  <=  1 

A 

II 

to 

A 

II 

CO 

MSVD 

42.9 

81.2 

93.6 

96.6 

MPII-MD 

28.8 

43.5 

56.4 

83.0 

MVAD 

15.6 

28.7 

37.8 

45.0 

Table  3.  Percentage  of  generated  sentences  which  match  a  sen¬ 
tence  of  the  training  set  with  an  edit  (Levenshtein)  distance  of  less 
than  4.  All  values  reported  in  percentage  (%). 

model  only  receives  very  weak  signals  with  regard  to  the 
kind  of  activities  depicted  in  YouTube  videos.  Some  com¬ 
monly  used  verbs  such  as  “play”  are  polysemous  and  can 
refer  to  playing  a  musical  instrument  (“playing  a  guitar”)  or 
playing  a  sport  (“playing  golf”).  However,  integrating  RGB 
with  Flow  improves  the  quality  of  descriptions. 

Our  ensemble  using  both  RGB  and  Flow  performs 
slightly  better  than  the  best  model  proposed  in  [43],  tem¬ 
poral  attention  with  GoogleNet  +  3D-CNN  (line  7).  The 
modest  size  of  the  improvement  is  likely  due  to  the  much 
stronger  3D-CNN  features  (as  the  difference  to  GoogleNet 
alone  (line  6)  suggests).  Thus,  the  closest  comparison  be¬ 
tween  the  Temporal  Attention  Model  [43]  and  S2VT  is  ar¬ 
guably  S2VT  with  VGG  (line  12)  vs.  their  GoogleNet-only 
model  (line  6). 

Figure  3  shows  descriptions  generated  by  our  model  on 
sample  Youtube  clips  from  MSVD.  To  compare  the  origi¬ 
nality  in  generation,  we  compute  the  Levenshtein  distance 
of  the  predicted  sentences  with  those  in  the  training  set. 
From  Table  3,  for  the  MSVD  corpus,  42.9%  of  the  predic¬ 
tions  are  identical  to  some  training  sentence,  and  another 
38.3%  can  be  obtained  by  inserting,  deleting  or  substituting 
one  word  from  some  sentence  in  the  training  corpus.  We 
note  that  many  of  the  descriptions  generated  are  relevant. 

5.2.  Movie  description  datasets 

For  the  more  challenging  MPII-MD  and  M-VAD 
datasets  we  use  our  single  best  model,  namely  S2VT  trained 
on  RGB  frames  and  VGG.  To  avoid  over-fitting  on  the 
movie  corpora  we  employ  drop-out  which  has  proved  to  be 
beneficial  on  these  datasets  [27].  We  found  it  was  best  to 
use  dropout  at  the  inputs  and  outputs  of  both  LSTM  lay¬ 
ers.  Further,  we  used  ADAM  [17]  for  optimization  with  a 
first  momentum  coefficient  of  0.9  and  a  second  momentum 
coefficient  of  0.999.  For  MPII-MD,  reported  in  Table  4, 
we  improve  over  the  SMT  approach  from  [28]  from  5.6% 
to  7.1%  METEOR  and  over  Mean  pooling  [39]  by  0.4%. 
Our  performance  is  similar  to  Visual-Labels  [27],  a  contem¬ 
poraneous  LSTM-based  approach  which  uses  no  temporal 
encoding,  but  more  diverse  visual  features,  namely  object 
detectors,  as  well  as  activity  and  scene  classifiers. 

On  M-VAD  we  achieve  6.7%  METEOR  which  sig¬ 
nificantly  outperforms  the  temporal  attention  model  [43] 


Approach  METEOR 


SMT  (best  variant)  [21  ] 

5.6 

Visual-Labels  [27] 

7.0 

Mean  pool  (VGG) 

6.7 

S2VT:  RGB  (VGG),  ours 

7.1 

Table  4.  MPII-MD  dataset  (METEOR  in  %,  higher  is  better). 

Approach 

METEOR 

Visual-Labels  [21] 

6.3 

Temporal  att.  (GoogleNet+3D-CNN)  [43] 4 

4.3 

Mean  pool  (VGG) 

6.1 

S2VT:  RGB  (VGG),  ours 

6.7 

Table  5.  M-VAD  dataset  (METEOR  in  %,  higher  is  better). 

(4.3 %)4  and  Mean  pooling  (6.1%).  On  this  dataset  we  also 
outperform  Visual-Labels  [27]  (6.3%). 

We  report  results  on  the  LSMDC  challenge5,  which 
combines  M-VAD  and  MPII-MD.  S2VT  achieves  7.0% 
METEOR  on  the  public  test  set  using  the  evaluation  server. 

In  Figure  4  we  present  descriptions  generated  by  our 
model  on  some  sample  clips  from  the  M-VAD  dataset. 
More  example  video  clips,  generated  sentences,  and  data 
are  available  on  the  authors’  webpages6. 

6.  Conclusion 

This  paper  proposed  a  novel  approach  to  video  descrip¬ 
tion.  In  contrast  to  related  work,  we  construct  descrip¬ 
tions  using  a  sequence  to  sequence  model,  where  frames 
are  first  read  sequentially  and  then  words  are  generated  se¬ 
quentially.  This  allows  us  to  handle  variable-length  input 
and  output  while  simultaneously  modeling  temporal  struc¬ 
ture.  Our  model  achieves  state-of-the-art  performance  on 
the  MSVD  dataset,  and  outperforms  related  work  on  two 
large  and  challenging  movie-description  datasets.  Despite 
its  conceptual  simplicity,  our  model  significantly  benefits 
from  additional  data,  suggesting  that  it  has  a  high  model 
capacity,  and  is  able  to  learn  complex  temporal  structure 
in  the  input  and  output  sequences  for  challenging  movie- 
description  datasets. 
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Correct  descriptions. 


S2VT:  A  man  is  doing  stunts  on  his  bike. 


S2VT:  A  herd  of  zebras  are  walking  in  a  field. 


S2VT:  A  young  woman  is  doing  her  hair. 


S2VT:  A  man  is  shooting  a  gun  at  a  target. 


Relevant  but  incorrect 
descriptions. 


S2VT:  A  small  bus  is  running  into  a  building. 


Irrelevant  descriptions. 


S2VT:  A  man  is  pouring  liquid  in  a  pan. 


S2VT:  A  man  is  spreading  butter  on  a  tortilla. 


S2VT:  A  black  clip  to  walking  through  a  path. 


(a)  (b)  (c) 

Figure  3.  Qualitative  results  on  MSVD  YouTube  dataset  from  our  S2VT  model  (RGB  on  VGG  net),  (a)  Correct  descriptions  involving 
different  objects  and  actions  for  several  videos,  (b)  Relevant  but  incorrect  descriptions,  (c)  Descriptions  that  are  irrelevant  to  the  event  in 
the  video. 


Temporal  Attention  (GNet+3D-convatt): 

(1)  At  night ,  SOMEONE  and  SOMEONE 
step  into  the  parking  lot. 

(2)  Now  the  van  drives  away. 

(3)  They  drive  away. 

(4)  They  drive  off. 

(5)  They  drive  off. 

(6)  At  the  end  of  the  street ,  SOMEONE 
sits  with  his  eyes  closed. 


S2VT  (Ours):  (1)  Now,  the  van  pulls  out  a  window  and  a 
tall  brick  facade  of  tall  trees  .  a  figure  stands  at  a  curb. 

(2)  Someone  drives  off  the  passenger  car  and  drives  off. 

(3)  They  drive  off  the  street. 

(4)  They  drive  off  a  suburban  road  and  parks  in  a  dirt 
neighborhood. 

(5)  They  drive  off  a  suburban  road  and  parks  on  a  street. 

(6)  Someone  sits  in  the  doorway  and  stares  at  her 
with  a  furrowed  brow. 


DVS:  (1)  Now ,  at  night ,  our  view  glides  over  a  highway 
its  lanes  glittering  from  the  lights  of  traffic  below. 

(2)  Someone's  suv  cruises  down  a  quiet  road. 

(3)  Then  turn  into  a  parking  lot . 

(4)  A  neon  palm  tree  glows  on  a  sign  that  reads 
oasis  motel. 

(5)  Someone  parks  his  suv  in  front  of  some  rooms. 

(6)  He  climbs  out  with  his  briefcase ,  sweeping  his 
cautious  gaze  around  the  area. 


Figure  4.  M-YAD  Movie  corpus:  Representative  frame  from  6  contiguous  clips  from  the  movie  “Big  Mommas:  Like  Father,  Like  Son”. 
From  left:  Temporal  Attention  (GoogleNet+3D-CNN)  [43],  S2VT  (in  blue)  trained  on  the  M-VAD  dataset,  and  DYS:  ground  truth. 
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